variational distribution
The Generalized Reparameterization Gradient
The reparameterization gradient has become a widely used method to obtain Monte Carlo gradients to optimize the variational objective. However, this technique does not easily apply to commonly used distributions such as beta or gamma without further approximations, and most practical applications of the reparameterization gradient fit Gaussian distributions. In this paper, we introduce the generalized reparameterization gradient, a method that extends the reparameterization gradient to a wider class of variational distributions. Generalized reparameterizations use invertible transformations of the latent variables which lead to transformed distributions that weakly depend on the variational parameters. This results in new Monte Carlo gradients that combine reparameterization gradients and score function gradients. We demonstrate our approach on variational inference for two complex probabilistic models. The generalized reparameterization is effective: even a single sample from the variational distribution is enough to obtain a low-variance gradient.
Variational Information Maximization for Feature Selection
Feature selection is one of the most fundamental problems in machine learning. An extensive body of work on information-theoretic feature selection exists which is based on maximizing mutual information between subsets of features and class labels. Practical methods are forced to rely on approximations due to the difficulty of estimating mutual information. We demonstrate that approximations made by existing methods are based on unrealistic assumptions. We formulate a more flexible and general class of assumptions based on variational distributions and use them to tractably generate lower bounds for mutual information. These bounds define a novel information-theoretic framework for feature selection, which we prove to be optimal under tree graphical models with proper choice of variational distributions. Our experiments demonstrate that the proposed method strongly outperforms existing information-theoretic feature selection approaches.
Coupled Variational Bayes via Optimization Embedding
Variational inference plays a vital role in learning graphical models, especially on large-scale datasets. Much of its success depends on a proper choice of auxiliary distribution class for posterior approximation. However, how to pursue an auxiliary distribution class that achieves both good approximation ability and computation efficiency remains a core challenge. In this paper, we proposed coupled variational Bayes which exploits the primal-dual view of the ELBO with the variational distribution class generated by an optimization procedure, which is termed optimization embedding.
- Asia > Middle East > Jordan (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)
Appendices
Appendix A provides derivations supporting Section 3 in the main paper. In this section we provide detailed derivations of the ST -DGMRF joint distribution, for both first-order transition models (Section A.1) and higher-order transition models (Section A.2). A.1 Joint distribution The LDS (see Section 2.2 and 3.1 in the main paper) defines a joint distribution over system states First, note that Eq. (1) can be written as a set of linear equations x We make use of this property in the DGMRF formulation and in the conjugate gradient method. Eq. 11 is converted into a discrete-time dynamical system by approximating ρ We consider two ST -DGMRF variants that capture different amounts of prior knowledge. DGMRF transition matrices can be parameterized accordingly. The air quality dataset is based on hourly PM2.5 measurements obtained from [ The raw PM2.5 measurements are log-transformed and standardized to zero mean and unit Ca. 50% of the nodes are masked out (purple nodes within We use a simple MLP with one hidden layer of width 16 with ReLU activations and no output non-linearity. The DGMRF parameters are not shared across time, allowing for dynamically changing spatial covariance patterns.
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > Canada > Ontario > Toronto (0.14)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- North America > Canada > Ontario > Toronto (0.14)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Robots (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)